Skip to content

Add asynchronous load method #10327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 144 commits into
base: main
Choose a base branch
from
Open

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 16, 2025

Adds an .async_load() method to Variable, which works by plumbing async get_duck_array all the way down until it finally gets to the async methods zarr v3 exposes.

Needs a lot of refactoring before it could be merged, but it works.

API:

  • Variable.load_async
  • DataArray.load_async
  • Dataset.load_async
  • DataTree.load_async
  • load_dataset?
  • load_dataarray?

async def load_async(self, **kwargs) -> Self:
# TODO refactor this to pull out the common chunked_data codepath

# this blocks on chunked arrays but not on lazily indexed arrays
Copy link
Contributor

@dcherian dcherian Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI dask has async compute but it seems hard to work in here :)

https://distributed.dask.org/en/stable/asynchronous.html

not has_zarr_v3, reason="zarr-python <3 did not support async loading"
)
@pytest.mark.asyncio
async def test_load_async(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this parallels test_load then lets keep this on the base class and use pytest.skip() on the netCDF subclasses. That way it's easy to keep the two in sync. Let's add a comment requesting future contributors to keep the two in sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did think about doing this but skipping an inherited method seemed messy.

I made it work in cf1d127, though. Note that the MRO becomes important to get the correctly overridden test to be inherited (so it can be skipped). I think the cleaner solution here would be if we could simply ask the backends whether or not they support async indexing, which is an idea we also discussed for #10579 (comment).

@pytest.mark.asyncio
async def test_lazy_async_indexing(self) -> None:
v = Variable(dims=("x", "y"), data=LazilyIndexedArray(self.d))
await self.check_orthogonal_async_indexing(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but we could combine the sync and async checks in one async function and just use that everywhere in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4f40792, though now I'm a little worried that this pattern

    async def check_orthogonal_indexing(self, v):
        expected = self.d[[8, 3]][:, [2, 1]]

        result = v.isel(x=[8, 3], y=[2, 1])
        assert np.allclose(result, expected)

        result = await v.isel(x=[8, 3], y=[2, 1]).load_async()
        assert np.allclose(result, expected)

might be automatically passing the second assertion by still being in-memory after the first assert?

Copy link
Contributor

@rabernat rabernat Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could parametrize the test over async and non async calls

@pytest.mark.parametrize("use_async", [True, False])
async def check_orthogonal_indexing(self, v, use_async):
        expected = self.d[[8, 3]][:, [2, 1]]

        if use_async:
            result = await v.isel(x=[8, 3], y=[2, 1]).load_async()
        else:
            result = v.isel(x=[8, 3], y=[2, 1])
        assert np.allclose(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use assert not v._in_memory to be really sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the parametrization idea, though the syntax has to be a little messier because you can't parametrize normal functions in pytest. a074a25

@pytest.mark.parametrize("cls_name", ["Variable", "DataArray", "Dataset"])
@pytest.mark.parametrize(
"indexer, method, target_zarr_class",
[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏾 👏🏾 👏🏾

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!

I left some minor comments that should be easy to address.

@TomNicholas TomNicholas added the plan to merge Final call for comments label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file enhancement io plan to merge Final call for comments run-upstream Run upstream CI topic-backends topic-indexing topic-NamedArray Lightweight version of Variable topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an asynchronous load method?
5 participants